Research Problem

There are many challenges that prevent existing data from being found and reused. Hence, understanding how researchers and support professionals discover and use data may facilitate future data reuse. Additionally, analyzing the factors limiting researchers’ ability to reuse data can help research support professionals better understand how to assist researchers looking for data.

Introduction

The data from our project was generated from a globally distributed survey. The goal of the survey was to investigate the habits of reusing and sharing at a larger scale. The data set contains 1677 responses from 105 countries and 31 unique disciplines.

The data that was collected consisted of only categorical variables. Additionally, there were multiple questions in the survey that included an open-ended response.

The types of analyses we carried out included bar charts, histograms, pie charts, classification trees, word clouds, etc.

The remainder of this report is structured as follows:

• In Section 1, we will examine and present visuals on the support data set. It will cover the basic EDA on the most important variables.

• In Section 2, we will cover the researchers data set.

• The final section, Section 3, will show and explain all analyses that were done comparing the 2 data sets. (Support vs Researchers)

Some major findings we discovered were that individuals would rather encourage data sharing over data reuse, there is a greater percentage of respondents who indicated they discouraged data sharing from the Middle East and Africa compared to all other regions, and data is used most as basis for new study.

Citation for the data:

Gregory.(2020). Data discovery and reuse practices of researchers and research support professionals.[Data set]. DANS-EASY. DOI.

Citation for the additional publication that used the data:

Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. (2019). Lost or found? Discovering data needed for research. arXiv preprint arXiv:1909.00464.

General Overview of Researchers and Support Dataset

The data contains 1677 total responses, 105 countries, 31 disciplines

The support dataset contained 47 respondents with 167 variables. Respondents include librarians, archivists, and research/data suport providers. The researchers dataset contained 1630 respondents with 165 variables. Respondents include researchers, students, managers, and “other” in which individuals indicated in the open response portion what role they identify with. Open response answers included professors, engineers, educators, physicans etc..

It should be noted that the support dataset has a relatively small sample size, so any conclusions or takeaways should be taken into account with the size of the data.

Analysis of Research and Support Datasets

SECTION 1: Support

The support dataset comprises of respondents whose roles include: Librarian, archivist, and research/data support provider.

We begin with general overview of the respondents, then progress into more specific analysis of the variables and significant testing where relevant.

Who do the respondents support? (L4)

The following shows the the count of the people whom the respondents support:

x
whosupprt_stud 35
whosupprt_res 44
whosupprt_indus 7
whosupprt_oth 9
whosupprt_othresp NA
The majority of respondents either support researchers or students.

The majority of respondents either support researchers or students.

Respondent demographics (experience, discipline, country/geographical region)

The following analyses will examine the demographics of respondents from the support dataset (experience & discipline & country).

Years of experience (D2)

Almost half of the respondents in the support dataset have 6-15 years of experience. It should be noted that there are only 3 respondents with 31+ years of experience in this dataset.

Almost half of the respondents in the support dataset have 6-15 years of experience. It should be noted that there are only 3 respondents with 31+ years of experience in this dataset.

Discipline of specilization (D1)

Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.

Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.

Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.

Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.

Country and geographical region of work (D3)

The most common countries of employment of the respondents from the support data are  USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.

The most common countries of employment of the respondents from the support data are USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.

The most common countries of employment of the respondents from the support data are  USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.

The most common countries of employment of the respondents from the support data are USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.

Data Need (L7)

The following examines what kinds of data the research support professionals need:

Data need overview

##    need_open     need_obs     need_exp     need_sim   need_deriv     need_oth 
##           32           39           19           18           24            8 
## need_othresp 
##            8
## 
##     need_obs    need_open   need_deriv     need_exp     need_sim     need_oth 
##   0.26351351   0.21621622   0.16216216   0.12837838   0.12162162   0.05405405 
## need_othresp 
##   0.05405405
Based on the bar graph, observational or empirical data was needed the most among the respondents in the support data.

Based on the bar graph, observational or empirical data was needed the most among the respondents in the support data.

Data need and respondent demographics

Conclusions:

  1. In relation to data needs, the researchers and support datasets are very similar, as observational or empirical data is needed the most compared to the majority of demographic categories in both datasets.

  2. Comparing the 4 main data needs variables (observational or empirical data, experimental data, simulation data, derived or compiled data) to respondents’ years of experience, we see that the graphs are relatively similar between most of the categories. However, when looking at the relationship between simulation data needs and experience, the percent of simulation data in the 31+ years of experience group is substantially greater than the other 3 experience categories. This may suggest that more experienced respondents may be more likely to need simulation data because of potential reasons such as needing large quantities of data or using simulations for more advanced studies.

Data Use (L8)

The following examines why the respondents or the people they support use or need secondary data, starting with the count of different uses and followed by a graph.

Data use overview

##  use_nwstdy    use_calb use_bnchmrk  use_vrfctn    use_inpt    use_idea 
##          38          12          16          20          19          26 
##     use_tch   use_nwprj   use_nwmth   use_trnds  use_cmprsn    use_smvs 
##          37          29          19          22          28          25 
## use_intgrtn     use_oth use_othresp 
##          27           1           1
The 3 most common purposes for which data is used are for a new study, for teaching/training, and to prepare for a new project/proposal respectively.

The 3 most common purposes for which data is used are for a new study, for teaching/training, and to prepare for a new project/proposal respectively.

Data use and respondent demographics

Conclusions:

  1. Comparing the 4 main data use variables (basis for a new study, teaching/training, new project or proposal, compare multiple datasets) to respondents’ years of experience, we see that using data for teaching or training is very common across all experience levels. This is reasonable because it can be assumed that respondents with lower levels of experience may be using data for training while more experienced respondents may be teaching others. When it comes to respondents with 31+ years of experience, they were much more likely to use data for other reasons, such as preparing for a new project, rather than as a basis for a new study.

  2. Comparing data use to respondents’ disciplines, we can observe that using data for teaching or training occupies a large percentage in all of the disciplinary subsets. This may emphasize that data is frequently used to teach or train no matter what discipline one may be in.

Data Find

The following examines how research support professionals find their data.

Data find (L10: how frequently do you find data in the following ways?)

##    find_actonln find_serendsrch  find_serendpas      find_share      find_netwk 
##              47              47              47              47              33 
##     find_creatr     find_collab       find_conf       find_list 
##              12              14              29              31
The most common way the respondents find their data is by actively searching online.

The most common way the respondents find their data is by actively searching online.

Conclusion:

The most common way the respondents find their data is by actively searching online shown by the first column in the graph above. However, even though serendipitously finding data did not happen as often, it should be noted that it is still an occasional occurrence.

Data find (L11: find data via what source)

The following examines what sources the respondents discover their data:

## 
##  find_netwk   find_list   find_conf find_collab find_creatr 
##   0.2773109   0.2605042   0.2436975   0.1176471   0.1008403
The most common source used to discover data is via conversations with personal networks, followed by via mailing lists or forums and via conferences.

The most common source used to discover data is via conversations with personal networks, followed by via mailing lists or forums and via conferences.

Conclusion:

When asking the respondents about the ways they or the people they support discover data, the most common source used is via conversations with personal networks, followed by via mailing lists or forums and consequently via attending conferences.

Data find and respondent demographics

Conclusions:

  1. Looking at the ways the respondents found data based on their experience levels, we see that almost half of those with 0-5 years of experience found their data with conversations with personal networks. However, as experience levels increased, the percentage of data found via the networks decreased dramatically. This may suggest that more experienced respondents are more likely to look for and find data outside of their initial network, such as using mailing lists or attending conferences instead.

  2. Comparing all the individual graphs, we can see that finding data via conversations with personal networks is relatively common among all the disciplines.

  3. Outside of North America and Europe, other countries seem to have a limited number of sources in which they find data. For example, none of the respondents from Australia/New Zealand and South/Central America indicated that they find data via attending conferences or via developing collaborations. However, the limited number of responses from those continents should be taken in consideration.

Conclusions: 1. The most used sources are the government, literature, and search engines. The least used source is commercials.

Tests for significance

## 
##  1-sample proportions test with continuity correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reusegrp, null probability 0.5
## X-squared = 3.7812, df = 1, p-value = 0.05183
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4986377 0.8325051
## sample estimates:
##      p 
## 0.6875
## 
##  1-sample proportions test with continuity correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reusedisc, null probability 0.5
## X-squared = 3.7812, df = 1, p-value = 0.05183
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4986377 0.8325051
## sample estimates:
##      p 
## 0.6875
## 
##  1-sample proportions test with continuity correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reuseorg, null probability 0.5
## X-squared = 1.3611, df = 1, p-value = 0.2433
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4352665 0.7637567
## sample estimates:
##         p 
## 0.6111111

SECTION 2: Researchers

In this section, we investigate the researchers data set containing survey answers from respondents who are researchers, students, managers, and individuals who indicated in open-response their roles including professors, educators, physicians, engineers to name a few.

This section begins with some overviews of the respondents in the data set, followed by more specific analysis through investigating certain variables grouped by the demographic characteristics of the respondents.

Role of respondents (Q1)

Most of the respondents from this dataset are researchers.

Most of the respondents from this dataset are researchers.

Demographics of respondents (experience, discipline, country/geographical region)

Years of Experience

Similar to the support dataset, the most frequent years of experience is between 6-15 years, followed by those with 16-30 years of experience.

Similar to the support dataset, the most frequent years of experience is between 6-15 years, followed by those with 16-30 years of experience.

Discipline of specialization (D1)

Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.

Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.

Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.

Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.

Country and geographical region of work (D3)

Most of the respondents from this dataset are employed in the USA and in  continents such as Europe, North America, and Asia respectively.

Most of the respondents from this dataset are employed in the USA and in continents such as Europe, North America, and Asia respectively.

Most of the respondents from this dataset are employed in the USA and in  continents such as Europe, North America, and Asia respectively.

Most of the respondents from this dataset are employed in the USA and in continents such as Europe, North America, and Asia respectively.

Data need (Q3)

We explore what respondents need data for in barplots below.

The greatest area of need for respondents is obervational or emperical data followed by experimental data.

The greatest area of need for respondents is obervational or emperical data followed by experimental data.

We then group respondents’ data need by their discipline groups. [Note, depending on how the disciplines are grouped, the results of the barplot could change].

Data Need and Respondent Disciplines

Observational or emperical data is needed the most out of all grouped disciplines, accounting for the highest percentage of data need in Social Sciences with over 50% of respndents of that discipline needing that type of data.

Observational or emperical data is needed the most out of all grouped disciplines, accounting for the highest percentage of data need in Social Sciences with over 50% of respndents of that discipline needing that type of data.

It appears that observational or empirical data is most needed followed by experimental data across all of the discipline groups. Overall, the trends in data need is as expected for each discipline group with respondents in the natural sciences having the greatest proportion needing experimental data, respondents in business having the greatest proportion needing simulation (models) data, respondents in the humanities having the greatest proportion needing derived or compiled data. Note, as mentioned above, these results may differ if the original disciplines from the survey were grouped differently.

We then explore trends in who finds data for the respondents.

Who finds data (Q6)

Over 50% of respondents find data themselves followed by nearly 25% of respondents who find data from someone in their personal network.

Over 50% of respondents find data themselves followed by nearly 25% of respondents who find data from someone in their personal network.

Over 50% of respondents find data themselves and almost 25% find data through someone in their personal network.

We then split by experience group below.

Who finds data by experience group

The majority of researchers in all experience groups find data themselves followed by finding data through someone in their personal network. Excluding the group other, the proportion of respondents in the 0-5 experience group who find data from graduate students is the lowest, and the proportion of respondents in the other experience groups who find data from research support professionals is the lowest.

The majority of researchers in all experience groups find data themselves followed by finding data through someone in their personal network. Excluding the group other, the proportion of respondents in the 0-5 experience group who find data from graduate students is the lowest, and the proportion of respondents in the other experience groups who find data from research support professionals is the lowest.

It appears that across all experience groups, respondents indicated they most often find data themselves. The percent of each experience group who indicate they find data themselves decreases as experience increases, nearly 60% of respondents in the 0-5 groups compared to less than 50% of the 31+ experience group. This may indicate the ability to outsource the data finding process to research assistants and graduate students among more experienced respondents or that more experienced respondents are more aware of support professionals who may aid in finding data.

We see similar levels of finding data through someone in their personal network and through research support professionals across all experience groups.

Next, we explore the challenges to finding data.

Challenges to finding data (Q11a)

Data is not accessible is the most common challenge to finding data followed by data being in many different places.

Data is not accessible is the most common challenge to finding data followed by data being in many different places.

We then split by experience group to better understand the challenges experienced at all experience levels.

Challenges to finding data by Experience group

In all experience groups, data are not accessible is the most common challenge among respondents, followed by data are in many different places. Excluding the challenge other, not having necessary personal networks appear to be the least common challenge throughout the experience groups. A smaller percentage of respondents in the 31+ experience group considers data not being accessible a challenge compared to the rest of the groups.

In all experience groups, data are not accessible is the most common challenge among respondents, followed by data are in many different places. Excluding the challenge other, not having necessary personal networks appear to be the least common challenge throughout the experience groups. A smaller percentage of respondents in the 31+ experience group considers data not being accessible a challenge compared to the rest of the groups.

It appears that the challenges to find data across all experience groups are similar with data being not accessible and data in many different places as the first and second most common challenge. Nearly 30% of respondents in the 0-5 experience group indicated data not accessible as a challenge while that challenges has less than 25% of respondents in the 31+ experience group. Contrasting, the percentage of respondents indicating data are in many different places increase as experience increases. This trend may indicate that the more experience a respondent has, the easier it becomes to navigate resources to find data, however, this accessibility may reveal that necessary data are inconveniently located in many different places. It could indicate the need for better resources for researchers with less experience to make data accessible for them and better consolidation or organization of data.

We explore some of the challenges by isolating the respondents who indicated they face a certain challenge and plotting other variables. The most interesting trends are shown below.

Relationship between data source and respondents who indicated they did not know where or how to find data [note that some sources did not show much difference and were not included in the plots. Only sources which had interpretable trends are shown below.]:

First, we isolated respondents who indicated they don’t know where or how to find data and compared their frequency of use of consultation with research support, data specific search engines, discipline-specific data repositories and general search engines with the frequency of use of the entire respondent group.

A lesser proportion of respondents who indicated they do not know where or how to find data as a challenge use consultation with research support, data specific search engines, and disciplinary specific data-repositories.

A lesser proportion of respondents who indicated they do not know where or how to find data as a challenge use consultation with research support, data specific search engines, and disciplinary specific data-repositories.

There appears to be a greater percentage of respondents who indicated they never use consultation with research support in the don’t know where and how to look for data group than the entire group. Percentage of indicating never for data specific search engines and disciplinary specific data repository increased in the don’t know where and how to look for data group compared to the entire group. Percentage of using the previously mentioned sources often decreased in the don’t know where and how to look for data group. These changes appear to indicate that the percentage of the group who indicated their challenge to finding data included not knowing where or how to find data use targeted sources (i.e data or disciplinary specific search engines and repositories and support professionals) less than the entire respondent group, and used general search engines such as Google more.

Next we compare the trends in the frequency of using discipline-specific data repositories, multidisciplinary data repositories, governmental agencies and websites, and professional associations between respondents who indicated they think online tools are inadequate and the entire respondent group [note that only sources which had interpretable trends are shown in the plots below]:

Respondents who indicated a challenge to finding data is inadequate online tools appear to use the following sources more than the entire respondent group.

Respondents who indicated a challenge to finding data is inadequate online tools appear to use the following sources more than the entire respondent group.

[Note that since the differences in the plot are slight, below is the percentages represented in the plot for a clearer comparison]

Discipline specific data repository: all respondents

## 
##        Never Occasionally        Often 
##        0.333        0.333        0.333

Discipline specific data repository: inadequate online tools

## 
##        Never Occasionally        Often 
##        0.173        0.402        0.425

Multidisciplinary data repositories: all respondents

## 
##        Never Occasionally        Often 
##        0.333        0.333        0.333

Multidisciplinary data repositories: inadequate online tools

## 
##        Never Occasionally        Often 
##        0.289        0.501        0.210

Government agencies and websites: all respondents

## 
##        Never Occasionally        Often 
##        0.333        0.333        0.333

Government agencies and websites: inadequate online tools

## 
##        Never Occasionally        Often 
##        0.180        0.455        0.365

Professional associations: all respondents

## 
##        Never Occasionally        Often 
##        0.333        0.333        0.333

Professional associations: inadequate online tools

## 
##        Never Occasionally        Often 
##        0.365        0.461        0.175

It appears that respondents who indicated a challenge to finding data was inadequate online tools use discipline specific data repositories, multidisciplinary data repositories, government agencies and websites, and professional association more than the entire respondent group. This could point to those sources lacking adequate online tools for researchers to find data.

Note that other sources such as general search engine, data specific search engines, code repositories etc. were not included because there was very little difference in percentage of frequency of use between all respondents and those who indicated inadequate online tools as a challenge.

Ease of finding data (Q11)

We explore the ease of finding data between students and researchers. Additionally we look at the challenges the students and researchers face when finding data.

Ease of finding data for students

## 
##             Difficult                  Easy Sometimes challenging 
##                     9                     3                    61
## 
##             Difficult                  Easy Sometimes challenging 
##                0.1233                0.0411                0.8356

Ease of finding data for researchers

## 
##             Difficult                  Easy Sometimes challenging 
##                   260                   124                   988
## 
##             Difficult                  Easy Sometimes challenging 
##                0.1895                0.0904                0.7201

A barplot of the levels of ease in finding data is shown below:

It appears that more students indicated data is somewhat challenging to find compared to researchers, 84% of students compared to 72% of researchers. However, more researchers indicated data is difficult to find, 19%, compared to students, 12%.

It appears that more students indicated data is somewhat challenging to find compared to researchers, 84% of students compared to 72% of researchers. However, more researchers indicated data is difficult to find, 19%, compared to students, 12%.

It appears that more students indicated data is somewhat challenging to find compared to researchers, 84% of students compared to 72% of researchers. However, more researchers indicated data is difficult to find, 19%, compared to students, 12%. A greater proprotion of researchers indicated finding data was easy than students.

We explore the challenges students and researchers indicated below.

Challenges for students and researchers

Most students and researchers indicate that data was not accessible followed by data are in many different places

Most students and researchers indicate that data was not accessible followed by data are in many different places

There not appear to dramatic differences in the challenges to finding data faced by students and researchers. Both the students’ and researchers’ greatest challenges is that data are not accessible followed by data are in many different places. Interestingly, the percentage of researchers who indicated they don’t have necessary personal networks or don’t know where or how to look for data is less than students. This makes sense as researchers may have more experience than students and thus would be expected to have a richer personal network and have more skills or experience in finding data.

The proportion of researchers who indicated online tools are inadequate and data are not digital is greater than students. These challenges could point resources for finding data have not kept up with the transition of data to a digital format or on platform online.

We then look at the frequency of using certain sources to find data.

Data source (Q7)

Academic literature and general search engines appear to be the most popular data sources with nearly 80% of respondents indicating they often use academic literature and nearly 70% of respondents indicating they often use general search engines. Code repositories and commercial sources appear to be the least popular data sources with just under 80% and just under 70% of respondents indicating they never use those sources respectively.

Academic literature and general search engines appear to be the most popular data sources with nearly 80% of respondents indicating they often use academic literature and nearly 70% of respondents indicating they often use general search engines. Code repositories and commercial sources appear to be the least popular data sources with just under 80% and just under 70% of respondents indicating they never use those sources respectively.

We now explore the frequency of use of certain data sources by respondent experience level [note that some sources are omitted because there was no interesting difference between the trends in use of all the respondents compared to respondents of different experience levels].

Frequency of using Academic Literature to find data:

As experience level increases, the frequency of using Academic Literature increases as well with under 60% of respondents in the 0-5 group indicating they often use Academic Literature to nearly 80% of respondents in the 31+ group.

As experience level increases, the frequency of using Academic Literature increases as well with under 60% of respondents in the 0-5 group indicating they often use Academic Literature to nearly 80% of respondents in the 31+ group.

Frequency of using Consultation with Research Support to find data:

Respondents in the 0-5 and 6-15 experience groups had similar percentages indicate they never or occasionally use Consultation with Research Support to find data while the percentage of respondents in the 16-30 and 31+ groups had a greater percentage indicate they occasionally use this source to find data than the percentage indicating they never use it.

Respondents in the 0-5 and 6-15 experience groups had similar percentages indicate they never or occasionally use Consultation with Research Support to find data while the percentage of respondents in the 16-30 and 31+ groups had a greater percentage indicate they occasionally use this source to find data than the percentage indicating they never use it.

Frequency of using General Search Engine (i.e. Google) to find data

The greatest percentage of respondents who indicate they often use General Search Engines (e.g Google) to find data occurred in the 0-5 group while the percentage of respondents in other groups who often use this source remain relatively the same at just under 60%.

The greatest percentage of respondents who indicate they often use General Search Engines (e.g Google) to find data occurred in the 0-5 group while the percentage of respondents in other groups who often use this source remain relatively the same at just under 60%.

It appears that the frequency of using these sources to find data broken down by experience group does not differ greatly from the trends across all respondents. For using Academic Literature, the percentage of respondents who indicated they often use the source increased with experience level perhaps due to the level of expertise needed to understand academic literature. For General Search Engines (e.g. Google), the percentage of respondents who indicated they often use the source decreases with the increase in experience level perhaps due to the knowledge of more effective and reliable sources such as academic literature or discipline specific resources. For Consultation with Research Support, the percentage of respondents who indicate they occasionally use this source to find data increases with experience level while, similarly, respondents who indicated they never use this source decreases as experience increases. This may be due to greater accessibility to support professionals among more experienced respondents, and more confidence in seeking out professional support.

We investigate trends in how data is used below.

Data Use (Q4)

Data is most commonly used for new studies followed by new projects and teaching and training. Excluding the variable other, data is used least for calibrating instruments or models.

Data is most commonly used for new studies followed by new projects and teaching and training. Excluding the variable other, data is used least for calibrating instruments or models.

Note that since data is most commonly used for new studies followed by preparing for a new project or proposal and then to generate new ideas, it appears data is mostly used to create or discover new things. This could be taken into account when interpreting the analyses on data sharing and reusing later in the report.

We then split data use by respondent discipline group.

Data use by Respondent discipline

Using data for the basis of a new study appears to be the most popular way to use data across all discipline groups.

Using data for the basis of a new study appears to be the most popular way to use data across all discipline groups.

Across discipline groups, data is used most as basis for new study comprising of the highest percentage of respondents in the Humanities and Arts discipline at just over 15%. The second highest percentage of respondents in Natural Science discipline group indicated they use data to verify their own data perhaps signaling the importance of repeated findings to validate the accuracy of researcher’s data. For respondents in Healthcare and Medicine and Multidisciplinary groups, the second highest percentage indicated using data to prepare for new project or proposal while the second highest percentage of respondents in the CS and Engineering group indicated using data for models, algorithms and system inputs and experiment with new methods or techniques.

Respondent perception of data sharing and reusing (D5/D6)

We explore the respondents’ personal perception of data sharing and reusing below:

Personal perception of data sharing

## 
## Data sharing is neither encouraged nor discouraged 
##                                                151 
##               Data sharing is somewhat discouraged 
##                                                 30 
##                Data sharing is somewhat encouraged 
##                                                440 
##               Data sharing is strongly discouraged 
##                                                 10 
##                Data sharing is strongly encouraged 
##                                                972 
##                         Don't know/ Not applicable 
##                                                 27
## 
## Data sharing is neither encouraged nor discouraged 
##                                             0.0926 
##               Data sharing is somewhat discouraged 
##                                             0.0184 
##                Data sharing is somewhat encouraged 
##                                             0.2699 
##               Data sharing is strongly discouraged 
##                                             0.0061 
##                Data sharing is strongly encouraged 
##                                             0.5963 
##                         Don't know/ Not applicable 
##                                             0.0166

Personal perception of data reusing

## 
## Data reusing is neither encouraged nor discouraged 
##                                                214 
##               Data reusing is somewhat discouraged 
##                                                 50 
##                Data reusing is somewhat encouraged 
##                                                517 
##               Data reusing is strongly discouraged 
##                                                 44 
##                Data reusing is strongly encouraged 
##                                                751 
##                         Don't know/ Not applicable 
##                                                 54
## 
## Data reusing is neither encouraged nor discouraged 
##                                             0.1313 
##               Data reusing is somewhat discouraged 
##                                             0.0307 
##                Data reusing is somewhat encouraged 
##                                             0.3172 
##               Data reusing is strongly discouraged 
##                                             0.0270 
##                Data reusing is strongly encouraged 
##                                             0.4607 
##                         Don't know/ Not applicable 
##                                             0.0331

Respondents who discourage data sharing

We now investigate deeper into respondents who indicate their perception of data sharing is “somewhat discourage” or “strongly discourage” data sharing.

Of the 40 respondents who indicated their perception of data sharing is somewhat or strongly discourage, the geographical region of the respondents are shown below:

## 
##                Africa                  Asia Australia/New Zealand 
##                  0.05                  0.17                  0.02 
##                Europe           Middle East         North America 
##                  0.41                  0.04                  0.19 
## South/Central America 
##                  0.11
## 
##                Africa                  Asia Australia/New Zealand 
##                  0.10                  0.10                  0.03 
##                Europe           Middle East         North America 
##                  0.38                  0.15                  0.17 
## South/Central America 
##                  0.07

There is a greater percentage of respondents who indicated they discouraged data sharing from the Middle East (15%) and Africa (10%) that the percentage of respondents from those regions out of the entire respondent group (4% and 5% respectively). This could potentially indicate regional differences in perception of data sharing.

Below is the perception of data sharing in larger scales such as among the work group, disciplinary community and at an organizational level:

It appears that the percentage of respondents who discourage data sharing among disciplinary community, organization or at work are dramatically lower in the entire survey group compared to just respondents who indicated they discourage data sharing. This is not surprising but the biggest difference occurring in the work group may indicate that the work those respondents engage in heavily influence their own perception of data sharing. Perhaps these respondents deal with sensitive information for their work, working for government agencies or secretive companies.

Below is the perception of data reuse on the same levels: work group, disciplinary community and at an organizational level.

Although there are higher percentages of respondents is the discourage data sharing group who indicated they discourage data reuse across all levels, more respondents indicated they somewhat encourage or neither encourage nor discourage data reuse compared to the majority of respondents indicating they discourage data sharing. This could indicate that this group of respondents are primarily concerned about sharing data, even more specifically, they perceive their work group as the least enthusiastic about sharing data.

Perception of sharing and reusing: Mean score (D5/D6)

Overview of the factors

To begin we will be exploring 4 different factors. The first 2 falls under perception of reuse and sharing, while the next two are variables that are transformed from given data.

1.avg_share (numerical) refers to the mean share across 4 types of sharing for each respondents.

2.avg_reuse (numerical) refers to the mean reuse across 4 types of reuse for each respondents.

3.disc_total(numerical) refers to the the total number of discipline reported by each individual, it is to be taken as a proxy measure of how multidisciplinary one’s work might be. It is a numerical value from 1 to 20.

4.total_chal (numerical) refers to the total number of challenges reported by individuals when finding data. For each challenges, they are reported a “1” for “Yes” and “0” for “No”. It is then summed up under a new numerical variable, “total_chal”.

Classification Methods for Demographics

There are many interesting and important demographics that can be explored. In this section, we will only be focused on a few aspects of demographics.

  1. Role of respondents (dem_role) - Researchers, Students. We will be focusing on two primary group of respondents, as they are most relevant to our research questions and nature of work, given that we are helping our clients at CMU Library with researchers and students

  2. Experience group (dem_exprnce) - These are the 4 experience group provided as used earlier.

  3. Shared (Y/N) (dem_shared) - It refers to whether the respondents have indicated that they have shared data. “Yes” being classified as “sharers” and “No” as “non-sharers”.

Scatterplot Matrix

To begin, we plot the 4 numerical variables in a scatterplot matrix to identify some of the interesting relationships we might investigate in detail further.

Average Reuse vs Average Share Across all population

We explore the perception of data sharing via a mean “score” by assigning each answer choice a value from 0 to 5 as follows:

Data sharing is strongly encouraged = 5

Data sharing is somewhat encouraged = 4

Data sharing is neither encouraged nor discouraged = 3

Data sharing is somewhat discouraged = 2

Data sharing is strongly discouraged = 1

Don’t know/ Not applicable = 0

Firstly, we investigate the relationship of average reuse vs average sharing across the entire data set.

Observation As seen above, as the average sharing increases, the average reuse indicated by respondents also increases. This could be useful for us to do a correlation test using linear regression later.

Diving deeper into different methods of analyzing the data.

  1. Roles of the respondents. Focused primarily on Researcher and Student

Roles - (Researcher & Student)

To begin with the roles classification, we will explore the following after classifying their type of role. (1) how they view sharing personally. (2) how they view sharing across disciplines. (3) How they view reuse personally. (4) How they view reuse across disciplines

Avg Reuse vs Avg Share

Observation There are 1372 response of researchers and only 73 that are student, as evidenced by the density and number of data points in the two separate plots. However, the overall trend are similar in both plots and resembles the overall trend of the population, which indicates a positive relationship between their mean sharing values and reuse values.

Linear regression modelling by roles
## 
## Call:
## lm(formula = avg_reuse ~ avg_share, data = main.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3297 -0.3304  0.2525  0.6696  2.9991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.00085    0.10336   9.683   <2e-16 ***
## avg_share    0.66577    0.02599  25.616   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9965 on 1628 degrees of freedom
## Multiple R-squared:  0.2873, Adjusted R-squared:  0.2868 
## F-statistic: 656.2 on 1 and 1628 DF,  p-value: < 2.2e-16

We regress average reuse on average sharing using linear regression. The adjusted R-squared is 0.2868.

Conclusions Firstly, both researchers and student share roughly similar trends for sharing individually. Both indicate more than 85% of somewhat and strongly encouraged for sharing. One difference is that for researcher, strongly encouraged is a large majority almost doubling that of somewhat encouraged while for student it is more evenly split.

Secondly, when looking at fig for sharing by disciplines, the distribution for is more symmetric than the first set of plots. When comparing between researchers and students, the distribution is again roughly similar.

However, when comparing between sharing types, one can see evidently that the strongly encouraged responses almost halved for sharing within disciplines and more were neutral or even discouraged when compared to sharing personally. This is aligned with the overall trend of the dataset.

Firstly, both researchers and student respond with overwhelming positivity for reusing of data personally, as seen in the left skewness of the data. This is encouraging to see, especially for that for both students and researchers where only 10% are discouraged to reuse data. For Students, there is also a very large portion of responses that felt they were strongly encouraged to reuse data, more than doubling that of those who felt only somewhat encouraged.

When pertaining to reuse within disciplines, the distribution is less skewed.The distribution for reusing by disc is more symmetric than that of reuse by self. Overall, the 2 roles do not show a huge difference in how they are discouraged. For researchers, somewhat and strongly discouraged makes up less than 10% and for students it adds up to a little more than 10%. However, a ar larger percentage of respondents in both group felt encouraged and more were neutral about it than reusing personally.

Main takeaway The role of the respondents account for little difference in the data set since that a large majority are researchers. ( 1445 out of 1630 ) It has the largest influence on the overall trend of the data hence, moving on we would not have a need for differentiating by roles given that both share similar trends on top of the dominant effect of the researchers role type.

Experience Group

Moving on we will try to investigate another aspect of demographics which is the level of experience. Firstly we check the relationship of average reuse vs average sharing with different experience group.

Similarly, we will investigate the following: (1) how they view sharing personally. (2) how they view sharing across disciplines. (3) How they view reuse personally. (4) How they view reuse across disciplines

Relationship between Average Reuse and Average Share with different experience groups

Conclusion The overall trend of Average Reuse vs Average Share is roughly similar across all 4 demographics, with a positive relationship between the 2 variables. One can also see the linear line shows a similar gradient across each experience group.

Observation For sharing individually, the portion of respondents who are somewhat or strongly encouraged to share data increases as age increases. There is also a high This could be that as experience increases, they are more confident with their data practices and are more willing and encouraged to share their datasets. While younger respondents might be less inclined due to fears of criticisms or that they simply do not know of avenues or means to share.

However, for sharing within disciplines, the trend is more similar and more symmetric. They are arguably less inclined to reuse by disciplines too.

Observation For reusing individually, the overall trend of the four group are similar. With majority indicating a “strong or somewhat encouraged” attitude towards reusing data. The trend for reusing by disciplines are also very similar, being more symmetric and less skewed.

One would expect the younger experience group to be more inclined to reuse data than other experience groups. As the inexperienced researchers have not refined their research methodology in terms of data collection or reading and thus feels more inclined to reuse data than more experienced individuals. However, it is not true in fact the 31+ age group shows high willingness to reuse data, similar to that of the 0-5 age group.

Overall trend As the experience of an individual increases, they become more inclined to share data as opposed to reusing data. This can be seen in the 31+ group being the least inclined to reuse data out of all the experience groups. This could be because that more experienced researchers often specialize in their field or have refined and preferred data collection methodologies. Their standards and requirements for data are often specific and are not as easily replicated.

Therefore, it is probably more advisable to advise reuse of data among younger researchers to allow them to develop their preferred data collection methods. (0-5 and 6-15). They are more malleable and receptive to data reuse and would also benefit more from learning refined methods of data collection or data extraction.

Conclusion It merits us to investigate further using experience levels and we will try to differentiate between sharers and non-sharers next.

Sharers vs Non-sharers graphs for average reuse and average share

Observation across different experience groups Similar trend is visible across both the sharers and non-sharers, resembling that of the entire population too. The same can be said across all experience group of respondents.(Note that the first figure is seen at the start of this section) This is perhaps due to the high number of sharers which will cause the entire dataset to resemble that of the sharers.

Investigating Perceptions of Non-sharers split by experience

Next, we seek to investigate the non-sharers since they are a minority and to perhaps get a better idea of what is deterring them from sharing. We will be plotting their overall responses to different ypes of sharing and resue as bar plots to understand the distribution or relationship if any.

Conclusion As one can see, there is a very low percentage of non-sharers who are somewhat or strongly encouraged to share data, ranging from 30-50%. However, the somewhat discouraged and strong discouraged are evidently the majority. This trend is similar for reuse.

There is also a slight positive relationship where as age increases, the non-sharers are more encouraged to share data. However, the same is not true for reuse. The youngest experience group and oldest experience group are the most encouraged to reuse data which is interesting. Since one might not expect the most experienced group to be as inclined to reuse data as the inexperience group.

Comparison for Perceptions of Reuse, Sharing Between Sharers and Non-sharers

Following that we do a side-by-side comparison for sharers and non-sharers to see if there are any significant findings or insight.

Perceptions on Sharing Individuals report that they most likely to share data themselves as opposed to across group entities. (disc, grp, org) This is true for both sharers and non-sharers.

Evidently the percentage of individuals who have shared have a higher overall willingness to share data as opposed to those have not shared. This is true across all 4 types of sharing.

Interestingly, there are still relatively high willingness of individuals to share data despite not having shared before.

Perceptions on Reusing There exists a similar trend to the perceptions on sharing is viewed here. Individuals report that they are more likely and encouraged to reuse as opposed to in groups. (disc, grp, org)

When comparing across those who have shared and not shared, evidently those who have shared had higher willingness to reuse data across all types. However, It is heartening to see

Comparison

The overall portion of individuals who are somewhat/strongly encouraged to share data is larger than that of individuals who reported the same willingness to reuse data. The overall portion of individuals who are ambivalent or somewhat/strongly discouraged to reuse data is larger than that of sharing data.

This could be because individuals often need specific data sets or specific methodology of which data is collected for best usage for their research. Hence they as a result, might be less willing to reuse prior data if they do not fit their needs as much.

Linear Model for predicting perceptions on self using their perceptions in group

Given that one more likely to share and reuse data individually than in group, we seek to investigate if their inclinations in a group setting can be used to predict how likely they are willing to share/reuse data individually.

## 
## Call:
## lm(formula = dem_sharself ~ dem_shardisc + dem_shargrp + dem_sharorg, 
##     data = main.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7970 -0.4345  0.0007  0.4954  2.4743 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.52567    0.07362  34.307  < 2e-16 ***
## dem_shardisc  0.13227    0.02209   5.989  2.6e-09 ***
## dem_shargrp   0.29242    0.02216  13.198  < 2e-16 ***
## dem_sharorg   0.07003    0.01892   3.701 0.000221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8112 on 1626 degrees of freedom
## Multiple R-squared:  0.2992, Adjusted R-squared:  0.2979 
## F-statistic: 231.4 on 3 and 1626 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = dem_reuseself ~ dem_reusedisc + dem_reusegrp + dem_reuseorg, 
##     data = main.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2359 -0.3100  0.0154  0.3646  3.4235 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.611550   0.063344  25.441  < 2e-16 ***
## dem_reusedisc  0.133147   0.022610   5.889 4.71e-09 ***
## dem_reusegrp   0.548475   0.023088  23.756  < 2e-16 ***
## dem_reuseorg  -0.007012   0.018855  -0.372     0.71    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8545 on 1626 degrees of freedom
## Multiple R-squared:  0.5094, Adjusted R-squared:  0.5085 
## F-statistic: 562.7 on 3 and 1626 DF,  p-value: < 2.2e-16

Observation There is higher correlation between their indication or reusing data within a group and reusing one-self than that of sharing. This indicates perhaps the behavior of reusing data in a group setting is a better indicator of reusing data individually than sharing in a group setting can used to predict if individuals are likely to share data.

In other words, the sharing perceptions are much more varied.

Possible Explanations for difference between Individual and Group sharing/reusing

Across all types of classification, a common trend for the difference between group sharing/reusing and individuals exists in that individual preference are more encouraging than that within groups.

These could be due to individuals being more clear and decisive in their data practices and therefore able to respond more positively when appropriate. When in group settings, the data practices or protocols might be differ from individuals or even have less clarity as there are differing priorities and projects all the time. The overall data practices and methodology might also encourage less sharing and reuse.

Another possible reason for a lack of reuse tendency in group settings, individuals feel more pressured by their peers and counterparts to engage in collection of primary data and not engage freely in reusing prior data. With that it leaves us room to explore why if other factors affected their inclination, like data needs, uses or even fields of expertise.

Other Factors Total Disciplines and Total Challenges

Next, we explore 2 other new variables that are transformed from the current data set.

Firstly, we investigate the effect of total number discipline, (total_disc) on (1) avg_share , (2) avg_reuse, (3)total challenges.

Secondly, we will investigate total challenges and its effect on average reuse to see if average reuse scores increases when researchers encounter more challenges when finding data.

EDA of Total Discipline

As previously mentioned, disc_total(numerical) refers to the the total number of discipline reported by each individual, it is to be taken as a proxy measure of how multidisciplinary one’s work might be. It is a numerical value from 1 to 20. Firstly, we shall take a look at the distribution of total discipline.

## 
##   1   2   3   4   5   6   7   8   9  10  15 
## 816 370 235 123  46  18   9   7   2   3   1

Observation A large majority of 816 individuals are only involved in 1 discipline, with 370 in 2 disciplines and 235 in 3 disciplines. The histogram is very right skewed with one mode at the 1 discipline.

Evidently, not majority of researchers are involved in multidisciplinary research. Despite that, we would investigate further to see if there are any positive effects on other factors.

Future work One could filter the groups into 2 main groups, such as multidisciplinary or only one discipline for a better split and distribution of the population.

EDA of Total Challenges

As previously mentioned, total_chal (numerical) refers to the total number of challenges reported by individuals when finding data. For each challenges, they are reported a “1” for “Yes” and “0” for “No”. It is then summed up under a new numerical variable.

## 
##   0   1   2   3   4   5   6 
## 139 269 398 447 250  88  39

Conclusion As seen in the histogram, the distribution of total challenges faced for researchers is relatively symmetrical with only a slight left skewness and a primary mode at 3 challenges. The top 3 number of challenges reported by respondents are 3 with 447 respondents, 398 respondents with 2 challenges and 269 with 1 challenge.

Bivariate Analysis

For the following parts we will be exploring our 2 new variables with other variables in various bivariate analysis.

Relationship between Total Disciplines and Average Share

Firstly, we investigate the relationship between average share and total discipline.

## 
## Call:
## lm(formula = avg_share ~ disc_total, data = main.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0887 -0.4184  0.1668  0.6668  1.1668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.80481    0.04118  92.404   <2e-16 ***
## disc_total   0.02839    0.01683   1.686   0.0919 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9494 on 1628 degrees of freedom
## Multiple R-squared:  0.001744,   Adjusted R-squared:  0.001131 
## F-statistic: 2.844 on 1 and 1628 DF,  p-value: 0.09192

Conclusion There is small positive relationship that is not very significant given an adjusted R-squared value of 0.001131 when predicting average sharing using total discipline.

Despite that, the effect is more significant for non-sharers than sharers.

This could be because that majority of the data set are sharers and also majority of responses do not have many disciplines resulting in the plotted line on the right to be less steep.

Relationship between Total Disciplines and Average Reuse

After exploring the effect of total number of disciplines on average sharing, we shall investigate its effect on average reuse and compare to see there is any significant difference.

## 
## Call:
## lm(formula = avg_reuse ~ disc_total, data = main.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.756 -0.541  0.321  0.709  1.459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.51034    0.05116  68.617   <2e-16 ***
## disc_total   0.03067    0.02091   1.467    0.143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.18 on 1628 degrees of freedom
## Multiple R-squared:  0.001319,   Adjusted R-squared:  0.0007059 
## F-statistic: 2.151 on 1 and 1628 DF,  p-value: 0.1427

Conclusion and Comparison With regards to fig __ , there is no real significant effect of total number of disciplines on average reuse scores for both groups since the adjusted R-squared value of predicting average reuse using total disciplines is only 0.0007059.

When comparing, the effect of total disciplines on sharing is more significant given the higher Adjust R-squared value of 0.001131 against 0.0007059. However, all in all one can still surmise that the correlation between total number of disciplines and the average sharing and reuse scores are low given the gentle sloop of the plotted line and the low Adjusted R-square values.

Relationship between Total Disciplines and Total Challenges

Next, we might hypothesized that as one is involved in more disciplines, the challenges faced in finding data could become more complex and thus one might run into more problems. Hence we seek to investigate this.

Observation There is no strong relationship or correlation observed between the two factor. Given that there is a high density of individuals with low number of disciplines there is no significant relationship.

Relationship between total challenges faced and tendency to reuse data

Next, we would like to see if the number of challenges they faced while finding data has resulted in an increased propensity or willingness to reuse data. This is on the assumption that if one have trouble finding desired data set they might be more likely to reuse prior data sets more readily available.

Observation For non-sharers, the relationship between number of challenges and reuse tendency is more defined and positive than that of the sharers. As they encountered more challenges, they are more likely to have reported a higher willingness to reuse data.

However, we should also be wary since that the sample size of non-sharers is significantly smaller than that of the sharers. Only 232 non-shares and 1398 sharers.

Classification Trees

The model of the classification tree is used to predict whether one has “shared” using 5 variables that we have mentioned earlier. They are also follows: (1) Willingness to share data individually
(2) Willingness to share data within disciplines (3) Willingness to reuse data individually (4) Willingness to reuse data within disicpline (5) Experience group

Conclusion The classification trees indicate that if they indicated a strong inclination to share themselves ( > 4), they are highly likely to have shared. (level 1, 85%) This is also known as the most important factor in determining whether have they shared data. Likewise when their experience is between 0-5 and given that they have expressed a willingness to share, they are also predicted to have shared more often that their counter parts. (level 2, 23%)

Limitations

Average Score Computation method

Due to the formula used to compute average score and average reuse, a “0” is given for response that are “Not Asked”. This will result in a lower score computed for both cases.

Transformation of variable

With regards to disciplines and challenges, denoting them as a binary classification might not reflect the whole spectrum of information available. Since one challenge might pose bigger problems than another challenge for instance, and they are not weighted equally in reality. The same could be said for the compexity or difficulty of the disciplines involved.

Open Response Eval (Q12)

The following examines open response question Q12, asking the respondents to specify any information they consider when deciding whether to use or not to use secondary data:

Conclusion:

According to the frequency of the words that appeared in the open response, the results highlight common words such as reliability, free/cost, time period/date of data, and relevance. Therefore, when deciding whether to use or not secondary data, researchers tend to have these factors in mind.

Tests for significance

## 
##  4-sample test for equality of proportions without continuity
##  correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reusegrp
## X-squared = 2.5545, df = 3, p-value = 0.4655
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4 
## 0.8029197 0.7463002 0.7450425 0.7766497
## 
##  4-sample test for equality of proportions without continuity
##  correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reusedisc
## X-squared = 2.9936, df = 3, p-value = 0.3926
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4 
## 0.7482993 0.6749522 0.6813472 0.6830357
## 
##  4-sample test for equality of proportions without continuity
##  correction
## 
## data:  reuse_counts$dem_reuseself out of reuse_counts$dem_reuseorg
## X-squared = 3.2616, df = 3, p-value = 0.353
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4 
## 0.7284768 0.6585821 0.6608040 0.6923077

SECTION 3: Support x Researchers

Importance of information to decide whether to use secondary data (Q12/L13)

We investigate the differences in how students, researchers, and support professionals values information when deciding whether to use secondary data.

Importance of information to decide whether to use secondary data for students

The majority of students indicate they find data collection conditions and methedology, reputation of data source, metadata and documentation, how the data was processed and handled, and topic relevance to be factors that are important when deciding to use secondary data

The majority of students indicate they find data collection conditions and methedology, reputation of data source, metadata and documentation, how the data was processed and handled, and topic relevance to be factors that are important when deciding to use secondary data

It appears that information such as data collection conditions and methodology, topic relevance, correct coverage and how the data was processed and handled where the most important for students when they decided whether to use secondary data. Conversely, personally knowing the data creator was not important for nearly 45% of the students and less important for nearly 30% of students. This could indicate that students care more about the data quality itself a bit more than the the data creator or source.

Importance of information to decide whether to use secondary data for researchers

Researchers indicated reputation of data source, collection conditions and methodology, ease of access, how the data was processed and handled, and topic relevance to be important factors when deciding whether to use secondary data

Researchers indicated reputation of data source, collection conditions and methodology, ease of access, how the data was processed and handled, and topic relevance to be important factors when deciding whether to use secondary data

It appears that researchers care about the similar information that are important to students when deciding whether to use secondary data with collection conditions and methodology being the most important followed by how the data was processed and handled and topic relevance. Similarly to the students, personally knowing the data creator is the least important factor with around 25% of researchers indicating it is not important followed by around 23% indicating it is less important. Researchers appear to care more about the quality of the data itself more than the data creator, size, format or the original intent of the data.

Importance of information to decide whether to use secondary data for support professionals

Support professionals indicated data collection conditions and methodology, detailed metadata and documentation, ease of data access, how the data was processed and handled, and topic revelance to be important when deciding whether to use secondary data

Support professionals indicated data collection conditions and methodology, detailed metadata and documentation, ease of data access, how the data was processed and handled, and topic revelance to be important when deciding whether to use secondary data

It appears that support professional consider the having detailed metadata and documentation for the data to be important information when deciding whether to use secondary data. Collection condition and methodology, topic relevance, ease of access, and how the data was processed and handled are also important information for support professionals. While data quality is still important information, it appears that support professional indicate information that aids in access and understanding data to be more important than the importance students and researchers place on this information. This makes sense as support professionals likely find data for clients who are the ones who actually use the data. While support professional are responsible to find good quality data, they consider information that would make accessing and understanding or explaining the data easier as well.

Importance of helping establish trust in data (Q14/L16)

We investigate the differences in how students, researchers, and support professionals establish trust in data.

Importance of factors helping establish trust in data for students

Transparency in data collection methods, lack of errors in the data, and having prior usage of the data are the factors than are improtant for students when establishing trust in data

Transparency in data collection methods, lack of errors in the data, and having prior usage of the data are the factors than are improtant for students when establishing trust in data

It appears that students believe that data transparency and accuracy are important factors when establishing trust in data. This is evident from the percentage of students who indicated extremely important and important for lack of error (over 30% and over 50%) and for transparency in data collection methods (nearly 50% and nearly 40%). There is also a large percentage of students who indicate prior usage as important (nearly 60%) in establishing trust in data but the percentage indicate it is extremely important is low comapred to other factors that appear to be important. This could indicate that while having prior usage is helpful in establishing trust, it is not as important as the accuracy, transparency and even ease of accessing data for students.

Importance of factors helping establish trust in data for researchers

Lack of errors, prior usage transparency in data collection methods and reputation of data source are all important factors for researchers when establishing trust in data.

Lack of errors, prior usage transparency in data collection methods and reputation of data source are all important factors for researchers when establishing trust in data.

It appears that researchers, like students, value factors dealing with data transparency and accuracy when it comes to establishing trust in data. The percentage of researchers who indicated transparency in data collection methods and lack of errors in data as extremely important (just over 50% and around 37%) surpasses the percentage of students who indicated the same. It appears that the reputation of the data source and ease of access is more important among researchers than students in establishing trust in data. This could indicate that researchers, who could be producing more important or impactful work than the majority of students, values not only data transparency and accuracy but also the reputation of their source and the ease of accessing data when determining whether to trust the data.

Importance of factors helping establish trust in data for support professionals

Support professionals consdier ease of accesstransparency in data collection methods, and the reputation of data source to be important factors when establishing trust in data.

Support professionals consdier ease of accesstransparency in data collection methods, and the reputation of data source to be important factors when establishing trust in data.

It appears that research support professional place more importance on ease of access and reputation of the data source than researchers or students. This is reasonable as research support professionals may not be experts in niche fields in which their clients any request data or aid. Thus, factors such as a reputable source or protected data could be interpreted as the data being more trustworthy. Transparency in data collection methods remains the factor that has the greatest percentage of respondents indicating extremely important which could mean that all respondents, students, researchers and support professional regard data collection methods to be the most important factor when deciding whether to trust data. Across all respondents, personal relationships with the data creator is indicated as not important in establishing trust in data.

Importance of establishing quality of data (Q15/L17)

We investigate below the differences in how students, researchers and support professional establish the quality of data.

Importance of factors helping establish quality in data for students

Factors such as lack of errors in data, data preparation, clarity, completness, and reputation of data source are all important in establishing quality of data for students

Factors such as lack of errors in data, data preparation, clarity, completness, and reputation of data source are all important in establishing quality of data for students

Similar to how students establish trust in data, it appears that factors regarding the data itself, the lack of error, data resolution or clarity, data completeness and reputation of the source, plays an important part for students. The lack of errors in data is extremely important for nearly 50% of students and important for nearly 40% of students. The reputation of the data creator and data size are less significant for students, over 20% indicated less important and over 10% indicated not important for both factors respectively.

Importance of factors helping establish quality in data for researchers

Lack of errors remain, data preparation, clarity and completeness all remain important in establishing quality of data for researchers.

Lack of errors remain, data preparation, clarity and completeness all remain important in establishing quality of data for researchers.

It appears that many factors such as lack of errors and data that were important to students are important for researchers when establishing the quality of data. It is interesting that researchers appear to value more the reputation of the data creator and reputation of data source than students (Over 45% of researchers indicated extremely important or important compared to around 35% of students for reputation of data creator. Around 70% of researchers indicated extremely important or important compared to around 60% of students for reputation of data source). The importance of data preparation appears to decease in researchers compared to students (around 16% and 42% of researchers indicating extremely important and important respectively compared to around 16% and 50% for students indicating the same level of importance). This could indicate that students place greater value on how “clean” or “prepared” data is when establishing data quality while researchers value reputation of the data origin. This could potentially indicate that students are more willing to use data that are created or collected by less reputable sources.

Importance of factors helping establish quality in data for support professionals

Support professionals share some similarities with students and researchers but values less in data size and ease of downloading and exploring when establishing quality in data.

Support professionals share some similarities with students and researchers but values less in data size and ease of downloading and exploring when establishing quality in data.

Support professionals, like students and researchers, value factors such as lack of errors, data clarity, consistency and completeness when establishing data quality. However, it appears that support professionals are not as certain in any specific factor being exceedingly important like students and researchers were about the importance of lack of errors. Data size appears to be less valuable for support professional when establishing quality of data with over 50% indicating that factor is less important or not important. Overall, support professionals appears to be decided in consistency in formatting, data completeness, lack of errors, and resolution or clarity of the data as important factors in establishing data quality, and less certain about other factors seem by the greater percentages indicating somewhat important than students or researchers.

## 
## Data sharing is neither encouraged nor discouraged 
##                                                151 
##               Data sharing is somewhat discouraged 
##                                                 30 
##                Data sharing is somewhat encouraged 
##                                                440 
##               Data sharing is strongly discouraged 
##                                                 10 
##                Data sharing is strongly encouraged 
##                                                972 
##                         Don't know/ Not applicable 
##                                                 27
## 
## Data sharing is neither encouraged nor discouraged 
##                                        0.092638037 
##               Data sharing is somewhat discouraged 
##                                        0.018404908 
##                Data sharing is somewhat encouraged 
##                                        0.269938650 
##               Data sharing is strongly discouraged 
##                                        0.006134969 
##                Data sharing is strongly encouraged 
##                                        0.596319018 
##                         Don't know/ Not applicable 
##                                        0.016564417

## 
## Data sharing is neither encouraged nor discouraged 
##                                                289 
##               Data sharing is somewhat discouraged 
##                                                105 
##                Data sharing is somewhat encouraged 
##                                                612 
##               Data sharing is strongly discouraged 
##                                                 34 
##                Data sharing is strongly encouraged 
##                                                538 
##                         Don't know/ Not applicable 
##                                                 52
## 
## Data sharing is neither encouraged nor discouraged 
##                                         0.17730061 
##               Data sharing is somewhat discouraged 
##                                         0.06441718 
##                Data sharing is somewhat encouraged 
##                                         0.37546012 
##               Data sharing is strongly discouraged 
##                                         0.02085890 
##                Data sharing is strongly encouraged 
##                                         0.33006135 
##                         Don't know/ Not applicable 
##                                         0.03190184

## 
## Data sharing is neither encouraged nor discouraged 
##                                                361 
##               Data sharing is somewhat discouraged 
##                                                121 
##                Data sharing is somewhat encouraged 
##                                                618 
##               Data sharing is strongly discouraged 
##                                                 32 
##                Data sharing is strongly encouraged 
##                                                434 
##                         Don't know/ Not applicable 
##                                                 64
## 
## Data sharing is neither encouraged nor discouraged 
##                                         0.22147239 
##               Data sharing is somewhat discouraged 
##                                         0.07423313 
##                Data sharing is somewhat encouraged 
##                                         0.37914110 
##               Data sharing is strongly discouraged 
##                                         0.01963190 
##                Data sharing is strongly encouraged 
##                                         0.26625767 
##                         Don't know/ Not applicable 
##                                         0.03926380

## 
## Data sharing is neither encouraged nor discouraged 
##                                                404 
##               Data sharing is somewhat discouraged 
##                                                104 
##                Data sharing is somewhat encouraged 
##                                                546 
##               Data sharing is strongly discouraged 
##                                                 34 
##                Data sharing is strongly encouraged 
##                                                430 
##                         Don't know/ Not applicable 
##                                                112
## 
## Data sharing is neither encouraged nor discouraged 
##                                         0.24785276 
##               Data sharing is somewhat discouraged 
##                                         0.06380368 
##                Data sharing is somewhat encouraged 
##                                         0.33496933 
##               Data sharing is strongly discouraged 
##                                         0.02085890 
##                Data sharing is strongly encouraged 
##                                         0.26380368 
##                         Don't know/ Not applicable 
##                                         0.06871166

Data find open response (Q10a/L12a)

The following looks into open response questions Q10a and L12a, asking respondents to discuss how their process for finding data is different than the process for finding academic literature.

Support dataset

Conclusion:

The responses to the question asking how the process for finding data is different than the process for finding academic literature included many words that showed up more than others (outside of data and literature which were part of the question). This included words like search, repositories, resources, and specific. Unlike finding literature, the results may suggest that many respondents may visit different repositories and are required to employ different search processes when finding data. The results did not appear to have any obvious patterns or differences between all the responses versus only from the respondents who answered yes.

Researchers dataset

Conclusion:

Similar to the support dataset, common words that were present in this open response question include ones such as search, specific, and sources. This may suggest that one of the reasons why researchers engage in a different process in finding data is because data requires specific sources (contacting data creator, specialized databases, etc) that are found through ways different from finding academic literature. For example, one respondent suggests that “finding literature is by google search, finding data has many more ways.”

Conclusion

From the support dataset, we learn that research support professionals were generally in the middle stages of their career, with a vast majority specializing in the natural and applied sciences and employed in either Europe or North America. The majority of researchers and research support professionals need observational or empirical data, and they use or need secondary data for either a new study or for teaching or training. Finding data primarily happens by actively searching online and through sources such as conversations with personal networks and mailing lists. Moreover, by analyzing the open response questions, we see that reliability, cost, date of data, and relevance influence researchers’ decision on whether or not to use secondary data.

From the researcher dataset, we found that an important factor in determining data share and (re)use behavior is the experience level of the respondents. Respondents with more experience indicated they find data themselves less than respondents with less years of experience. Additionally, challenges to finding data pertaining to having the necessary resources and connections also decrease as the years of experience increases. The challenge of data being in many different places increases with experience level, perhaps indicating with more access to available data, respondents with more experience realize it becomes more difficult to find specific or targeted data. This could point to a broader issue with the organization of data beyond just the accessibility of data.

Students indicated finding data was more challenging than respondents who indicated themselves to be researchers. Additionally, more students indicated lack of personal networks and data is inaccessible to be challenges compared to researcher

Limitations and future work

When it comes to the limitations of our analyses and the dataset, there are a couple that should be noted. First of all, the support dataset was relatively small, with only 47 respondents. This means that the limited number of responses to certain questions and analyses of various demographic variables must be taken into account. In addition, because of a low response rate to the survey, the potential for nonresponse bias must be considered as responses may not be representation of the larger target population. As mentioned before, the dataset also works with categorical data, limiting our ability to carry out further analyses or inferences.

This limitation aligns with the potential next steps, as we would love other survey questions that could gather quantitative responses from the researchers and support professionals. Additionally, a suggested future area of research could pertain to the organization of data in more specific sources and areas. From the analysis on data find challenges, we found that respondents with more experience had less difficulty accessing data compared to respondents with less experience but had a greater proportion indicating data to be in many different places compared to less experienced groups. Another suggested area of future work is in the challenges of finding data among students. From our analysis in ease of finding data among students and researchers, we discovered students overall indicated finding data is difficult with one of their challenges being data is not accessible. The question of if data is not accessible because students lack the necessary experience to simply obtain the data they desire or if they are not aware of data sources and data find resources available to help them find data could be explored further.